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Abstract 

Recently one step mutation matrices were introduced to model the impact of substitutions on arbitrary branches of a 
phylogenetic tree on an alignment site. This concept works nicely for the four-state nucleotide alphabet and provides 
an efficient procedure conjectured to compute the minimal number of substitutions needed to transform one 
alignment site into another. The present paper delivers a proof of the validity of this algorithm. Moreover, we provide 
several mathematical insights into the generalization of the OSM matrix to multi-state alphabets. The construction of 
the OSM matrix is only possible if the matrices representing the substitution types acting on the character states and 
the identity matrix form a commutative group with respect to matrix multiplication. We illustrate this approach by 
looking at Abelian groups over twenty states and critically discuss their biological usefulness when investigating 
amino acids. 
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Background 

Alignments of homologous sequences provide fundamen- 
tal materials to the reconstruction of phylogenetic trees 
and many other sequence-based analyses (see, e.g., [1,2]). 
Each alignment column (site) consists of character states 
that are assumed to have evolved from a common ances- 
tral state by means of substitutions. Any combination of 
the character states in the aligned sequences at one align- 
ment column represents a so-called character [3], which 
is sometimes also called site pattern [4]. Given a phy- 
logenetic tree and an alignment that evolved along the 
tree, Klaere et al. [5] showed, for binary alphabets, how a 
character changes into another character if a substitution 
occurs on an arbitrary branch of the tree. The impact of 
such a substitution is summarized by the so-called One 
Step Mutation (OSM) matrix. The OSM matrix allows for 
analytical formulae to compute the posterior probability 
distribution of the number of substitutions on a given tree 
that give rise to a character [5]. 
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Nguyen et al. [4] extended the concept of the OSM 
matrix to the four- state nucleotide alphabet while devel- 
oping a method, the MISFITS algorithm, to evaluate the 
goodness of fit between models and data in phyloge- 
netic inference. There, the OSM matrix is constructed 
based on the Kimura three parameter (K3ST) substitu- 
tion model [6]. Nguyen et al. [4] illustrated how one can 
apply the Fitch algorithm [7] to compute the minimal 
number of substitutions required to change one char- 
acter into another character under the OSM setting. In 
the present paper, we deliver a proof of the validity of 
this algorithm. 

In addition, the OSM matrix can be constructed only 
if the matrices representing the substitutions, the so- 
called substitution matrices, and the identity matrix form 
a commutative or Abelian group (see, e.g., [8]) with 
respect to matrix multiplication [4]. The link between 
Abelian groups in phylogenetic models has been stud- 
ied before, most notably by Hendy et al. [9]. Further, 
an extension of nucleotide substitution models with an 
underlying Abelian group to joint states at the leaves of 
a tree has also been studied by other authors. Bashford 
et al. [10] introduced an approach very similar to 
OSM to study the multi-taxon tensor space. Bryant [11] 
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also introduced a very similar framework to study the 
Hadamard transform of [12] in the light of multi- taxon 
processes. 

In this work, we first introduce standard phylogenetic 
notation. We then formalize the construction of the OSM 
matrix, and which part of its construction is used in the 
Misfits algorithm. We further present possible exten- 
sions of the OSM framework to arbitrary alphabets. 
We will show that the MISFITS algorithm in fact com- 
putes the minimal number of substitutions needed to 
change one character into another character. Moreover, 
we discuss the extension of the algorithm to substitu- 
tion models which do not have an underlying Abelian 
group. Finally, we discuss the Abelian groups available for 
amino acids. 



Notation and problem recapitulation 

Notation 

Recall that a rooted binary phylogenetic X-tree is a tree 
7" = (V(T),E(T)) with the following properties: There 
is one vertex p e V(T) with indegree 0 and outdegree 
1, which is called the root of T. All edges e e E(T) are 
directed away from p, and all vertices v e V(T) \ {p} have 
indegree 1 and outdegree 0 or 2. Vertices with outdegree 
0 are usually referred to as leaves of T. Remember that for 
an X-tree, there are exactly \X\ = n leaves, which is why 
there is a bijection between the set of leaves of T and the 
taxon set X. Thus, when there is no ambiguity, we use the 
terms leaf and taxon synonymously. Moreover, we often 
just write "phylogenetic tree" or "tree" when referring to a 
rooted binary phylogenetic tree. 

Furthermore, recall that a character f is a function / : 
X C for some set C := {c\, C2, C3, . . . , c r } of r char- 
acter states (r e N). We denote by C n the set of all r n 
possible characters on C and n taxa. For instance, for the 
four-state DNA alphabet, Cdna = {A, G, C, T} and the set 
Cp NA consists of all 4 n possible characters. 

An extension of / to V(T) is a map g : V(T) -> C 
such that g(i) = /(/) for all i in X. For such an extension 
g off, we denote by lr(g) the number of edges e = {u,v} 
in T on which a substitution occurs, i.e. where g(u) 7^ 
g(v). The parsimony score of / on T, denoted by lr(f)> is 
obtained by minimizing lr(g) over all possible extensions 
g. Given a tree T and a character / on the same taxon 
set, one can easily calculate the parsimony score off on 
T with the famous Fitch algorithm [7] . Moreover, when 
a character state changes along one edge of the tree, we 
refer to this state change as substitution or mutation. As 
for our purposes only so-called manifest mutations are 
relevant, i.e. those mutations that can be observed and 
are not reversed, we do not distinguish between muta- 
tions and substitutions, which is why we use these terms 
synonymously. 



Construction of the OSM matrix 

We now introduce the OSM framework in a stepwise fash- 
ion. The aim of the OSM approach is to determine the 
effects a single mutation occurring on a rooted tree T has 
on a character evolving on that tree. 

The first task of this approach is to formalize the term 
mutation and its effects on a single character state in C. A 
mutation is an operation a : C — >> C which is bijective, i.e. 
it satisfies the following condition: 

CI. For all c; e C there is a cj e C such that afe) = 9, 
and if cr (q) = cr(cj), then C{ = cj. 

This guarantees that a mutation affects a character state 
in a unique fashion. It is well-known that any bijective 
function on a finite discrete state set is a permutation 
(e.g., [13]). Thus, a mutation is a specific instance of a 
permutation applied to a character. 

The next step is to select the set E of admissible per- 
mutations acting on C. It is mathematically convenient to 
select E such that it forms an Abelian group [9] with a 
regular (transitive and free) action on C. Hence, E satisfies 
the following conditions: 

C2. For every pair Cj, cj e C there is exactly one 

permutation a e £ such that afe) = c ; -, i.e., the 

action of E on C is regular. 
C3. For all ai, (72 e £ also the product o\ o a 2 £ £• 

Mathematically speaking, E is closed with respect to 

concatenation of its permutations. 
C4. For all o\, 02 £ E we have o\ o 02 = 02 o o\. Thus, E 

is commutative, and hence the order in which we 

assign permutations is irrelevant for the outcome. 
C5. There is an element gq e E such that for all o\ e E 

we have o\ o gq = oq o o\ = o\, i.e. there exists a 

so-called neutral element, namely the identity, in E. 

For all Ci e C only (Jofe) = i-e. °7 is fixed point free 

for all 0{ 7^ gq. 
C6. For every o\ e E there exists a 02 e E such that 

o\ o 02 = <Jq. Mathematically speaking, for every 

element of E there exists an inverse element. This 

guarantees that every permutation can be reversed 

within a single step. 
C7. For all o\, 02, ^3 e E we have 

o\ o {02 o (73) = {p\ o 02) o 03 = o\ o 02 o CT3, i.e. the 

associative law holds. 

It should be noted that any set of permutations is asso- 
ciative, i.e. satifies C7. Thus, for a set of permutations E 
to be Abelian with a regular action on C it only needs to 
satisfy CI— C6. 

In the following, we consider the matrix representation 
of permutations. A permutation matrix over C is an r x r 
matrix such that o CiCj = 1 if cr(ci) = Cj, and 0 other- 
wise. We consider it equivalent to discuss a permutation 
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or its corresponding matrix. Therefore, concatenation "o" 
is equivalent to the matrix multiplication We use a to 
denote a permutation or a permutation matrix, depending 
on the context. 

Example 1. In genetics, the most commonly used char- 
acter state set is Cdna = {A, G, C, T}. There are two 
different Abelian groups for four states, namely the Klein- 
Four-group Z2 x Z2 and the cyclic group Z4. The Klein- 
Four-group is constructed from the cyclic group Z2 over 
two elements, the identity ro and the flip x\. These take the 
matrix form 



*o = 



*i = 



0 1 

1 0 



The Klein-Four -group consists of the four Kronecker 
products of these two matrices, i.e. sq = ro 0 To, s\ = 
x\ 0 ro, 52 = ro 0 T\, and 53 = x\ 0 x\. The Kronecker 
products here yield 4x4 matrices, e.g., 



s\ = Ti 0 to = 




A C G 

A /O 0 1 

0 0 0 

10 0 



C 
G 

r 



\o 1 



r 

o\ 
1 
0 

0 0/ 



77ze se£ Ejost •= {50*51,52,53} coincides with the sub- 
stitution matrices under the Kimura 3ST model [6]. In 
particular, s\ describes transitions within purines (A, G) 
and pyrimidines (C, T), 52 represents transversions within 
pairs (A, C) and (G, T), and S3 represents the remaining set 
of transversions within pairs (A, T) and (C, G). 

The second Abelian group over four states, the cyclic 
group Z4, is formed by selecting a 4-cycle, e.g., A —> G —> 
T —> C — »► A and concatenating this cycle with itself. The 
resulting set of permutations E^ 4 contains the following 
elements: 



5i = 





A 


c 


G 


T 


A 


(0 


0 


1 


o\ 


C 


1 


0 


0 


0 


G 


0 


0 


0 


1 


T 


U 


1 


0 


0/ 


s'\- 


= 5i 


■si, 


5' 3 


= s f 



J n /4 

50 = 5 v 



Note that there are actually six different four-cycles for 
Cdna- These result in three distinguishable Abelian groups. 
Bryant [14] generates his cyclic group with the four-cycle 
A C —> G —> T —> A, and shows that the result- 
ing set T,K2ST underlies the Kimura 2ST model [IS], where 



52 corresponds to the transition within purines and pyrim- 
idines, and s\ and 53 are the (not further distinguished) 
transversions. 

The next step in constructing the OSM matrix is to con- 
struct a set E r of operations over C n governed by T, 
and based on the permutation set E. To this end, we first 
define E w as a set of operations which work elementwise, 
i.e. for / = (fi, . . . ,f n ) e C n and o e Y/ 1 we have 

a(f) := (ai<fi),...,a n (fn)), <*i G £. 

This can also be described by the Kronecker product, 
i.e. equally 



(1) 



This means that there are r n different operators in E 71 = 
E0..-0E. 

Remark 1. Therefore, for any pair of characters f,g e C n 
we can find an operation a e Y/ 1 such that o if) = g. 

Another noteworthy consequence of using the 
Kronecker product is that the elements of Y/ 1 are per- 
mutations over C n [16,17], and in fact Y n satisfies our 
Conditions CI— C7, i.e. Y n is an Abelian group over C n . 

In the OSM framework we assume that the permuta- 
tions acting on a character/ e C n are derived from the 
underlying rooted tree T. If permutation 0{ e £ acts 
on the pendant edge leading to taxon ; e X, then the 
associated permutation matrix o^ 1 acting on C n has the 
form 

j-l n 
cr j4 := (g)or o 0^0 (g) cr 0 . 
1=1 H+i 

If a permutation acts on an interior edge e, then it simul- 
taneously acts on the states of all descendant taxa of e, i.e. 
all those taxa whose path to the root passes e. E.g., assume 
Taxa 1 and 2 form a cherry, i.e. their most recent common 
ancestor, 12, has no other descendants, and permutation 
Oi e E, i = 1, . . . , r — 1 is acting on the edge leading to 
this ancestor. Then, we get the permutation 



^.12,* . „ 
o := o\ 



) (Tq • • • < 



)Or 0 = G U 



r 2,i 



(2) 



This shows in particular that a Kronecker product 
of some permutations acting on each character state is 
equivalent to the matrix product of the permutations act- 
ing on the entire character. The right hand side equation 
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shows that a single permutation on an internal edge has 
the same effect as simultaneously applying the same per- 
mutation on the pendant edges of all descendant taxa. In 
other words, if de(e) denotes the set of descendants of 
edge e, and 0{ e E, then 

a^= Y\ (3) 

jede(e) 

Note that the set E x of all permutations acting on the 
pendant edges is a generator of E w , i.e. the closure of E x 
contains all permutations in E w . Since Y/ 1 contains a sin- 
gle permutation to transform character/ g C n intog g C n , 
and since E x generates E w , there is a shortest chain of 
permutations in E x which transforms/ into g. E x is also 
the set of permutations implied by the star tree for X In 
general, the set of all permutations on tree T is 

E r = {or*' 1 ' : e G £(T), i G {0, . . . , r - 1}} , 

where r is the number of states in E. 

For every X-tree T we have E r D E x , and therefore 
E 7 " is a generator for E w , too. An illustration of such a 
generator set E 7 " over the character set C n is the so-called 
Cayley graph [18], which has as vertices the characters of 
C n , and two characters f,geC n are connected if there is 
a permutation a e E r such that a/) = g. In [5] Cayley 
graphs have been presented as alternative illustrations of 
the tree T over a binary state set C = {0, 1}. 

Example 2. Regard the K3ST model from Example 1 
and the rooted two-taxon tree depicted in Figure la. With 
this ^k 3ST is given by the set 

s ei>1 := 5i ® so, 562,1 : = s o ® si, s 612 ' 1 := s± ® si, 
Z 1 ' 2 : = 5 2 0 s 0 , s e2,2 := s 0 0 s 2 , s ei2,2 := s 2 (8) s 2 , 
5 ei ' 3 := s 3 <g> s 0 , s e2 ' 3 := s 0 0 s 3 , s ei2 ' 3 := s 3 (8) s 3 . 

£^c/z permutation which acts on the characters is thus a 
symmetric 16 x 16 permutation matrix depicting a tran- 
sition (s eyl ), transversion 1 (s e ' 2 ), or transversion 2 (s e ' 3 ) 
along edge e e E(T). Figures lb-d display the permutation 
matrices for a transition on branch e\ (s e1,1 ), e 2 (s 62 ' 1 ) and 
en (s 612 ' 1 ), respectively. Figure le shows the Cayley graph 
associated with Y^ 3ST 

We are now in a position to recall the definition of the 
OSM matrix Mj- for a rooted binary phylogenetic tree T 
as explained in [5] and [19]. For an edge e e E(T) we 
denote by p e the relative branch length of e, i.e. its actual 
branch length (expected number of substitutions per site) 
divided by the length of T (the sum of all branch lengths). 



Thus, one can view p e as the probability that a muta- 
tion is observed at edge e assuming that a single mutation 
occurred on T. Clearly, J2 ee E(T)P e = F ur t ner > denote 
by otej the probability that this mutation on e is of type 
i G {1, . . . , r — 1} with Y?r^ a e>i = 1 for all e G E(T). Then 
the OSM matrix is the convex sum of the elements in E r , 
where each permutation cr e ' 1 is multiplied by ct e ,ipe> the 
probability of hitting the edge e with permutation 0{ G E . 
Thus, we obtain: 

r-l 

eeE(T) i=l 

Mj- can be regarded as the weighted exchangeability 
matrix for all characters under the K3ST model assuming 
that a single substitution occurs on the tree T. Figure If 
depicts the OSM matrix for the tree in Figure la. Here, 
colors indicate relative branch lengths p e , and patterns 
denote permutation types c^. E.g., a blue square with 
horizontal lines indicates the product p e2 a e 2 ,i> i- e - the- 
probability of observing a transition si on edge e 2 . 

The transformation problem 

With the construction of E r we have generated the tools 
needed to formally describe the computations in Step 4 of 
the Misfits algorithm [4]. Given a rooted tree T and two 
characters/ and/^ in C n , we want to compute the minimal 
number of substitutions required on the tree to convert / 
into/^. [4] presented an efficient procedure to compute 
this minimal number of substitutions. 

Algorithm 1 

INPUT: rooted binary phylogenetic tree T on leaf set X, 
characters/ and/^ on X, Abelian group E. 
STEP 1: Using Remark 1, find the substitution type 
0[ which translates f into ft for all positions 
j = 1, . . . , \X\. Let a g E w be the resulting operation, 
i.e.a(f)=f d . 

STEP 2: Let c := c\ . . . c\ be a constant character on 
X with ci G C. Let h := cr(c). 
STEP 3: Calculate m := l r (h). 
OUTPUT: m. 

We prove the correctness of our algorithm. In our 
framework, m corresponds to the minimum number of 
permutations a\, . . . , a m G E such that o\ (8) • • • ® o m {f) = 
f d . In this form, m has multiple equivalent interpretations. 
It is the length of the shortest path between / and f d in 
the Cayley graph for E r , where this path corresponds to 
o\ (g) • • • (8) o m . Further, m corresponds to the minimum 
power (k) of Mr such that M J r (f \f d ) = 0 for j < k and 
Mj-(f,f d ) > 0, because a positive entry inM^- means that 
there is a concatenation of k permutations connecting the 
associated characters. 
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Figure 1 Construction of the OSM matrix, (a) A rooted tree with taxa 1 and 2. (b) A transition si on the left branch e\ (the red branch) changes a 
character into exactly one new character as depicted by the red horizontal stripe cells of the permutation matrix a e],S] .The matrix has 16 rows and 
16 columns representing the possible characters for the alignment of two nucleotide sequences. The permutation matrices generated by si for the 
right branch e 2 (blue) and for the branch leading to the "root" en (green) are displayed in (c) and (d), respectively. The corresponding Cayley graph 
for the tree is illustrated in (e).The convex sum of all the weighted (by the relative branch length and the probability of the substitution type) 
permutation matrices generated by all substitution types for all branches is the OSM matrix of the tree (My) as shown in (f). Horizontal stripe cells 
represent the probability of the transition si ; diagonal stripes the transversion s 2 ; and thin reverse diagonal stripes the transversion s 3 . The colors of 
these cells indicate the relative branch lengths and follow the colors of the branches as in (a). Thus, these colors also depict the branch origin of the 
substitutions. 
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Example 3. Figure 2 demonstrates how Algorithm 4 
works under the K3ST model, i.e. when the group is £ = 
^K3ST (Figure 2a). Consider the rooted five-taxon tree in 
Figure 2b and the character GTAGA at the leaves. Assume 
that the character GTAGA is to be converted into character 
ACCTC. By comparing the two characters position-wise, 
we need a substitution s\ on the external branch leading 
to taxon 1 to convert G into A at the first position. Sim- 
ilarly, we need a substitution s\ on the external branch 
leading to taxon 2, and a substitution S2 on every exter- 
nal branch leading to taxa 3, 4, and 5. Thus, the operation 
s := (51,51,52,52,52) transfers the character GTAGA into 
the character ACCTC. As the operation s also translates 
the constant character AAAAA into GGCCC, converting 
GTAGA into ACCTC is equivalent to evolving the char- 
acter state A at the root along the tree to obtain the 
character GGCCC at the leaves. The Fitch algorithm [7] 
applied to the character GGCCC with the constraint that 
the character state at the root is A produces a unique most 
parsimonious solution of two substitutions as depicted 
by Figure 2c. 

Results 

The impact of parsimony on the estimation of substitutions 

In this section, we provide some mathematical insights 
into the role of maximum parsimony in the estimation of 
the number of substitutions needed to convert a charac- 
ter into another one as explained above. In particular, we 
deliver a proof for Algorithm 4. 

Theorem 1. Let T be a rooted binary phylogenetic tree 
on taxon set X and letf be a character that evolved on T 
due to some evolutionary model and let f d be another char- 
acter on X. Then, the minimum number of substitutions to 



be put on T which change the evolution off in such a way 
thatf d is generated can be calculated with Algorithm 4. 

Proof. Let f,f d , X, T and X be as required for the input 
of Algorithm 4. Then, as defined in the algorithm, we have 

0<f) = @\(f\)>a2(f2),...,a n (fn)) = f d > where a, e E 
refers to the substitution type needed to translate fj into 

rd 

J j ' _ 

Considering the underlying tree / , we may assume 

<ti, . . . , cr n act on the pending branches leading to taxa 

1, . . . , n, respectively. 

Now we show that it is equivalent to consider <x(c), 

where c is a constant character, instead of cr(f). Let \i e 

E 7 " be a transformation with jiif) — f d . Then, 

a- 1 on(f)=a- 1 (f d )=f. (5) 

Next, let a e S T be such that <t(c) = /. Then, using (5), 
we have 

<7 _1 o/io (x(c) = er -1 o 11(f) =f = a(c). 

On the other hand, we can use the commutativity of the 
underlying Abelian group to derive 

<7 _1 o/io (x(c) = a o <7 _1 o 11(c). 

So altogether we have 

<7 _1 o ix o a(c) = a o o /i(c) = a(c) 

and therefore <r -1 o fj,(c) = c and thus fj,(c) = o(c). As 
/i was arbitrarily chosen, this implies that any transfor- 
mation which maps / to a(f) = f d also maps c to a(c). 
Therefore, we have 

{p € E r : p(f) =f d ] = {p € S r : p(c) = a(c)}. 



(a) 




(b) 



T 
C 



A 
C 



1 2 3 4 5 



G 
T 



A 
C 



(c) 

G G 



C C 



12 3 4 



® 



Figure 2 Computing the minimal number of substitutions to translate a character into another one. (a) depicts the Klein-four group £ K 3st, 
which consists of the identity So and the three substitution types S],S2, S3 from the K3ST model, (b) In order to convert the character GTAGA into 
ACCTC under Xk3st, we need to introduce the operation s := (51,51,52,52,52). As the operation 5 also translates the constant character AAAAA to 
GGCCC, converting GTAGA into ACCTC is equivalent to evolving the character state A at the root along the tree to obtain the character GGCCC at the 
leaves. The Fitch algorithm applied to the latter produces a unique most parsimonious solution of two substitutions as depicted by (c). 
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The minimum number of substitutions to change/ from 
f d on T is just an element of the first set consisting of the 
fewest number of compositions. As the two sets are equal, 
we can investigate the second set rather than the first. So 
we need an element of the second set which consists of as 
few as possible compositions. Assuming that a = o\ (g) 
• • • (g) a n , we can assign cri, . . . , a n to the pending branches 
of T and treat them like character states to which we then 
apply the Fitch algorithm. This completes the proof. □ 

Informally speaking, the idea is as follows: As there is 
exactly one path from the root p to any taxon x e X, we 
wish to determine whether we can pull up' some of the 
operations along this path in order to affect more than one 
taxon and still give the same result. This idea has been 
described above (Equations (2) and (3)), and it coincides 
precisely with the idea of the parsimony principle. 

However, in order to avoid confusion regarding the 
operation a as a character on which to apply parsi- 
mony, Algorithm 4 instead acts on the constant character. 
Clearly, in order to evolve the constant character c := 
c\ • • • c\ on a tree with root state c±, the corresponding 
operation would be a := gq ® • • • ® oq. Note that o (c) = h 
and o(f) =f d , and that two character states in h are iden- 
tical if and only if the corresponding substitutions in o are 
identical, too. Therefore, it is possible to let MP act on h 
rather than directly on a. 

By the definition of maximum parsimony, when applied 
to h on tree T with given root state c\, it calculates the 
minimum number m of substitutions to explain h on T. 
This number m is therefore precisely the number of sub- 
stitutions needed to generate h on T rather than c. As 
cr(f) = f d , m also is the number of substitutions needed 
to generate/^ from/ on T. 



The impact of different groups 

For any alphabet C, there might be more than one Abelian 
group. Different groups might result in different num- 
bers of substitutions required to translate a character into 
another character. We illustrate this observation using the 
following example. 

Example 4. Recall the starting point of Example 3, i.e. 
regard the five-taxon tree T from Figure 3b, and the char- 
acters f — GTAGA and f d — ACCTC. Now, instead of 
using ^K3ST we use the permutations from the cyclic group 
S^ 4 . In this setting, we need a substitution s f % (blue in 
Figure 3a) on the external edge leading to taxon 1 to con- 
vert G into A at the first position, and so on. Thus, we 
get the operation s'2 := (53,51,53,5^,53) such that s'(f) — 
f d . We immediately see, that s f transforms the constant 
character c = AAAAA into h = CGCGC. The Fitch 
algorithm applied to the character CGCGC with the con- 
straint that the character state at the root is A produces a 
unique most parsimonious solution of three substitutions 
as depicted by Figure 3c. Thus, under the E c group we 
need one substitution more than under the ^K3ST group (cf. 
Example 3). 

Note that variation of the minimum number of substi- 
tutions needed to translate a character into another one 
between different groups is not surprising: As different 
substitution types are needed to translate one pattern into 
the other one, depending solely on the underlying group, 
one group might need the same substitution type for some 
neighboring branches in the tree and another group differ- 
ent ones. Informally speaking, this would imply that in the 
first case, the substitution could be "pulled up" by the Fitch 



(a) 




(b) 



v-G T A G A 
La C C T C 

S 3 S l S 3 S l S 3 



1 2 3 4 5 



(c) 

CGCGC 



1 2 3 4 5 



c 

(A. 



Figure 3 Converting one character into another character using the cyclic group, (a) depicts the cyclic group E C; which consists of the identity 
Sq = so and the three substitution types s\ , s' 2 , s' 3 for nucleotide character states, (b) In order to convert the character GTAGA into ACCTC using this 
group, we need to introduce the operation s' := (s r 3 ,s\ , s 3 ,s\ ,s r 3 ). As the operation s' also transforms the constant character AAAAA to CGCGC, 
converting GTAGA into ACCTC is equivalent to evolving the character state A at the root along the tree such that the character CGCGC is attained at 
the leaves. The Fitch algorithm applied to the latter produces a unique most parsimonious solution of three substitutions as depicted by (c). 
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algorithm to happen on an ancestral branch, whereas in 
the second case this would not be possible. 

The link between substitution models and permutation 
matrices 

In Examples 1 and 2 we have shown that the K3ST sub- 
stitution model can be included into our framework. The 
connection between the Klein-Four-group and the K3ST 
model (as well as the one between the Z2 group and sym- 
metric 2-state model) were described in-depth in [9]. This 
section aims at discussing alternative models and how to 
identify their use (or lack thereof) for our approach. 

Most substitution models assume the independence of 
the different branches of a tree to compute the joint proba- 
bility of the characters in C n . Therefore, they use the prob- 
abilities for substitutions among the character states in C 
along the edges of the tree T. We now establish a proba- 
bilistic link between and C n . This link is provided by 
BirkhofFs theorem: 



Assigning permutation matrices to the respective param- 
eters yields the set £ s gtr with elements sq (identity) 
and 
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Theorem 2 (BirkhofFs theorem, e.g., [20], Theorem 
8.7.1). A matrix M is doubly stochastic, i.e., each column 
and each row ofM sum to 1, if and only if for some N < 00 
there are permutation matrices cf\ 1 ...,cfn an d positive 
scalars a\, . . . , e [ 0, 1] such that a\-\ h a n = 1 and 

M = 0L\O\ + • • • + OtN&N< 

Therefore, the weighted sum of the permutation matri- 
ces in YJ yields a doubly stochastic matrix Mq- as intro- 
duced above. Mj- also describes a random walk on C n 
governed by T where the single step in C n is illustrated by 
the associated Cayley graph. Its stationary distribution is 
uniform, i.e. when we throw sufficiently many mutations 
on T then we expect to see each character with probability 
l/r n . 

Another, even more useful consequence of BirkhofFs 
theorem is the fact that it tells us which substitution mod- 
els are suited for the OSM approach. If the transition 
matrix associated with the substitution model is doubly 
stochastic, then we find a set of permutations which give 
rise to the model 

Let us see how this influences the symmetric form of 
the general time reversible model (sGTR) with uniform 
stationary distribution. It has the transition probability 
matrix 



The weighted sum of the non-identity elements yields 

as a + bsi, + cs c + dsd + es e + fsj 

/ d + e +/ ab c\ 

a b + c+fd e 

b da + c + e f 

\c ef a -\-b + d) 



which is equal to P S GTR if a + b + c + d + e +f = 1. Thus, 
the set S S GTR is to sGTR what Eiost is to K3ST. However, 
^sGTR does not satisfy condition C5, because s a , • • • ,Sf are 
not fixed point free. This can be seen as the main diago- 
nal of s a , • • • ,Sf does not only contain zeros. It is also not 
commutative (condition C4) as e.g. s a • s c 7^ s c • s a . And 
it is not closed under matrix multiplication (condition 
C3), which means that a concatenation of permutations in 
^sGTR might lead to a new permutation not in E s gtr> e -g-> 
s a • Sf ^ S S GTR- Other complex models like Tamura-Nei 
[21] do not even permit the decomposition of its transi- 
tion matrix into the convex sum of permutation matrices. 
All of this shows why the overall applicability of complex 
models to the OSM approach is rather limited. 
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There are other approaches to describe phylogenetic 
models based on the group structure of their substitution 
matrices. In particular, Sumner et al. [22] use Lie algebra 
to construct OSM type matrices for the general Markov 
model, and discuss shortcomings of the group structure 
for the general GTR model [23]. 

Application to other biologically interesting sets 

As stated above, OSM-type models require an underly- 
ing Abelian group. Thus, the OSM setting is applicable 
not only to binary data or four-state (DNA or RNA) 
data, but also to alphabets of 16 (doublets), 64 (codons), 
and 20 characters (amino acids) respectively. We compare 
such extensions to existing biologically motivated binning 
approaches and discuss their relevance. 

As we have shown in the previous sections, the sym- 
metric form of the Klein-Four-Group Z2 x Z2 is math- 
ematically beautiful, computationally convenient and 
biologically relevant. Similar statements can be made 
about all powers of Z 2 , including the biologically relevant 
alphabets of 16 (doublets) and 64 (codons) letters. 

There are four Abelian groups for twenty-state alpha- 
bets, namely Z2 x Z2 x Z5, Z4 x Z5, Z2 x Z10, and the cyclic 



group Z20 (see e.g., [24] for a complete list of all groups 
with up to 35 elements). Their construction is analogous 
to the construction of the Klein-Four-group in Example 1. 
For example, the elements of Z4 x Z5 are Kronecker prod- 
ucts of one of the four permutations in the cyclic group Z4 
with one of the five permutations of the cyclic group Z5. 

Figure 4 shows a heat-map type visualization of an 
OSM-type matrix on a single-leaf tree where the color- 
ing of the cells corresponds to the weights given to the 
20 permutations in the respective groups. We see that the 
coloring pattern nicely reflects the four cosets of the sub- 
group Z5 in Z2 x Z2 x Z5. This can also be interpreted 
as a binning of the 20 states in the underlying alphabet 
into four sets of five elements each. If the weighting cor- 
responds to a convex combination of operations, then the 
visualized matrix is doubly stochastic. 

Binnings are also done for amino acids, using either bio- 
chemical properties or evolutionary divergence. An exam- 
ple of a biochemical binning is the hydrophobic index, 
where the 20 amino acids are binned into four groups, 
very hydrophobic, hydrophobic, neutral, and hydrophilic. 
Unfortunately, this binning does not correspond to any 
of the proposed Abelian groups. Moreover, it is difficult 



Si 














Figure 4 Matrices illustrate the four Abelian groups for a twenty-state alphabet, (a) the Z2 x Z2 x Z5 group, (b) Z4 x Zs, (c) Z2 x Z10, and 

(d) Z 2 o- Each matrix visualizes the cosets of the subgroups of the depicted group and suggests an associated grouping of the 20 states. 
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to derive transitions between these groups just from the 
biochemical properties. 

Transition matrices for evolutionary models for amino 
acid substitutions are usually generated by counting 
mutation types in the alignments (see, e.g., [25] for an 
overview). From these, optimal groupings can be obtained 
using clustering approaches [26]. The existence of esti- 
mates for the transition probability between all amino 
acids provides the possibility to get further information 
about between-group operations. These groupings could 
be forced to fit Abelian groups. However, as indicated in 
[26] a grouping into four groups of five amino acids each 
is rarely optimal. 

Conclusions 

In this paper, we provide the necessary mathematical 
background for the OSM setting which was introduced 
and used previously [4,19], but had not been analyzed 
mathematically for more than two character states. More- 
over, the present paper also delivers new insight concern- 
ing the requirements for the OSM model to work: In fact, 
we were able to show that mathematically, it is sufficient 
to have an underlying Abelian group - which shows a gen- 
eralization of the OSM concept that was believed to be 
impossible previously [4]. Therefore, we show that OSM 
is applicable to any number of states. 

However, note that the original intuition of the authors 
in [4] was biologically motivated: The authors supposed 
that the group not only has to be Abelian, but also sym- 
metric in the sense that each operation can be undone 
by being applied a second time. Thinking about DNA, for 
instance, this works: For example, the transition from A 
to G can be reverted by another substitution of the same 
type, namely a transition from G to A. This symmetry 
condition is fulfilled by the Klein-Four-group, but not by 
the cyclic group on four states. 

While the OSM approach can be extended to any num- 
ber of states, its biological relevance becomes somewhat 
obscure when there is no corresponding group which 
is a power of Z 2 . In particular, there are four distinct 
Abelian groups for 20 states, but none fits a biologically 
meaningful binning of the 20 amino acids. 

Competing interests 

The authors declare no competing interests. 
Authors' contributions 

All authors contributed equally. All authors read and approved the final 
manuscript. 

Acknowledgements 

SK thanks Marston Conder for fruitful discussions on the group theoretical 
background and Jessica Leigh for enlightening discussions on biochemical 
and evolutionary binnings of amino acids. This work is financially supported by 
the Wiener Wissenschafts-, Forschungs- and Technologiefonds (WWTF). AvH 
also acknowledges the funding from the DFG Deep Metazoan Phylogeny 
project, SPP (HA1 628/9) and the support from the Austrian GEN-AU project 
Bioinformatics Integration Network III. 



Author details 

1 Department for Mathematics und Computer Science, 
Ernst-Moritz-Arndt-University Greifswald, Walther-Rathenau-Strasse 47, 1 7487 
Greifswald, Germany, department of Statistics and School of Biological 
Sciences, University of Auckland, Private Bag 92019, Auckland, New Zealand. 
3 Groningen Bioinformatics Centre, University of Groningen, Nijenborgh 7, 
9747 AG Groningen, The Netherlands. 4 Center for Integrative Bioinformatics 
Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University of 
Vienna, University of Veterinary Medicine Vienna, Dr. BohrGasse 9, A-1030, 
Vienna, Austria. 

Received: 1 7 October 201 1 Accepted: 1 0 December 201 2 
Published: 15 December 201 2 

References 

1 . Durbin R, Eddy SR, Krogh A, Mitchison G: Biological sequence analysis - 
Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge 
University Press; 1998. 

2. Mount DW: Bioinformatics: Sequence and Genome Analysis. New York: Cold 
Spring Harbor; 2004. 

3. Semple C, Steel M: Phylogenetics. New York: Oxford University Press; 2003. 

4. Nguyen MAT, Klaere S, von Haeseler A: MISFITS: evaluating the 
goodness of fit between a phylogenetic model and an alignment. 
Mol Biol Evol 201 1 , 28:1 43-1 52. 

5. Klaere S, Gesell T, von Haeseler A: The impact of single substitutions on 
multiple sequence alignments. Philos TR Soc B 2008, 363:4041 -4047. 

6. Kimura M: Estimation of Evolutionary Distances between 
Homologous Nucleotide Sequences. PNatl Acad Sci USA 1981, 
78:454-458. 

7. Fitch WM: Toward defining the course of evolution: Minimum 
change for a specific tree topology. SystZool 1 971 , 20:406-41 6. 

8. Humphreys JF:/\ course in group theory. New York: Oxford University Press; 
1996. 

9. Hendy M, Penny D, Steel M: A discrete Fourier analysis for 
evolutionary trees. P Natl Acad Sci USA 1 994, 91 :3339-3343. 

10. Bashford JD, Jarvis PD, Sumner JG, Steel MA: U(l) x U(l) x U(l) 
symmetry of the Kimura 3ST model and phylogenetic branching 
processes. J Phys A: Math Gen 2004, 37181— L89. 

1 1 . Bryant D: Hadamard Phylogenetic Methods and the n-taxon process. 
Bull Math Biol 2009, 71 (2):339-351 . 

1 2. Hendy MD, Penny D: A framework for the quantitative study of 
evolutionary trees. SystZool 1 989, 38(4):297-309. 

1 3. MacLane S, Birkhoff G: Algebra. Chelsea: American Mathematical Society; 
1999. 

1 4. Bryant D: Extending Tree Models to Split Networks. In Algebraic 
Statistics for Computational Biology. Edited by Pachter L, Sturmfels B. 
Cambridge: Cambridge University Press; 2005. 

1 5. Kimura M: A Simple Method for Estimating Evolutionary Rates of 
Base Substitutions through Comparative Studies of Nucleotide 
Sequences. J Mol Evol 1 980, 1 6:1 1 1 -1 20. 

1 6. Horn RA, Johnson CR: Topics in matrix analysis. New York: Oxford 
University Press; 1991. 

1 7. Steeb WH, Hardy Y: Matrix Calculus and Kronecker Product: A Practical 
Approach to Linear and Multilinear Algebra. 2nd ed. Singapore: World 
Scientific Publishing; 201 1 . 

1 8. Cayley A: Desiderata and Suggestions: No. 2. The Theory of Groups: 
Graphical Representation. Am J Math 1 878, 1 (2):1 74-1 76. 

1 9. Nguyen MAT, Gesell T, von Haeseler A: ImOSM: Intermittent Evolution 
and Robustness of Phylogenetic Methods. Mol Biol Evol 201 2, 
29(2):663-673. 

20. Horn RA, Johnson CR: Matrix analysis. Cambridge: Cambridge University 
Press; 1990. 

21. Tamura K, Nei M: Estimation of the number of nucleotide 
substitutions in the control region of mitochondrial DNA in humans 
and chimpanzees. Mol Biol Evol 1 993, 1 0(3):5 1 2-526. 

22. Sumner JG, Holland BR, Jarvis PD: The Algebra of the General Markov 
Model on Phylogenetic Trees and Networks. Bull Math Biol 201 2, 
74(4):858-880. 

23. Sumner JG, Jarvis PD, Fernandez-Sanchez J, Kaine B, Woodhams M, 
Holland BR: Is the general time-reversible model bad for molecular 
phylogenetics? Syst Biol 201 2, 61 (6): 1 069-1 074. 



Fischer etal. Algorithms for Molecular Biology 201 2, 7:36 
http://www.almob.Org/content/7/1/36 



Page 1 1 of 1 1 



24. Keilen T: Endliche Gruppen. Eine Einfiihrung mit dem Ziel der 
Klassifikation von Gruppen kleiner Ordnung. 2000. [http://www. 
mathematik.uni-kl.de/wwwagag/download/scripts/Endliche.Gruppen. 
pdf] 

25. Kosiol C, Goldman N: Different Versions of the Dayhoff Rate Matrix. 

Mol Biol Evol 2005, 22(2):1 93-1 99. 

26. Susko E, Roger AJ: On Reduced Amino Acid Alphabets for 
Phylogenetic Inference. Mol Biol Evol 2007, 24(9):21 39-21 50. 



doi:1 0.1 186/1 748-71 88-7-36 

Cite this article as: Fischer etal.: On the group theoretical background of 
assigning stepwise mutations onto phylogenies. Algorithms for Molecular 
Biology 201 2 7:36. 





Submit your next manuscript to BioMed Central 
and take full advantage of: 



• Convenient online submission 



• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



(3 BioMed Central 



